Randomized Experiments

PSCI 8357 - STAT II

Georgiy Syunyaev

Department of Political Science, Vanderbilt University

February 17, 2026

Overview

  • A randomized experiment is the gold standard for making causal inferences
  • Randomization of the treatment will make the treatment and control groups similar on average with respect to observed and unobserved covariates

  • Advantage 1: Identification is justified by design of experiments

    • We control the treatment assignment mechanism.
    • We do not need to make CIA-type assumptions.
  • Advantage 2: Estimation is simple

    • Difference-in-means (DiM) or some weighted averages of DiM.
  • Advantage 3: Inference is simple

    • We can again use the known treatment assignment mechanism as a “reason basis for inference”.
  • Many identification strategies in observational studies aim to mimic the logic of randomized experiments

Overview

  • Neyman Approach

    1. Causal estimand is the ATE.
    2. Standard analysis tools for most experiments.
    • Limitation 1: Asymptotic approximation is required for inference

      \(\rightsquigarrow\) Inference is not reliable with small sample size

    • Limitation 2: Variance can be complicated for complex experimental designs.

  • Fisherian Approach

    1. Focus on a sharp null hypothesis (no effect for every unit).
    2. Assumption-free (valid for any sample size).
    3. Flexible: can accommodate any complex experimental designs.
  • Note: Both are design-based inference (the primary source of randomness comes from treatment assignment)

Motivating Example

Example: Social Pressure Experiment


  • Voter turnout theories based on rational self-interested behavior generally fail to predict significant turnout unless they account for the utility that citizens receive from performing their civic duty.
  • Two aspects of this type of utility, intrinsic satisfaction from behaving in accordance with a norm and extrinsic incentives to comply.
  • Gerber, Green, and Larimer (2008) test intrinsic motives in a large scale field experiment by applying varying degrees of extrinsic pressure on voters using series of mailings to 180,002 households before the August 2006 primary election in Michigan.

    • \(Y_i\): Voted in the primary election (Outcome)
    • \(T_i\): Type of mailing (Treatment)

Example: Social Pressure Experiment


  • T1: Civic Duty
    • Encouraged to vote.
  • T2: Hawthorne
    • Encouraged to vote.
    • Told that researchers would be checking on whether they voted.
  • T3: Self
    • Encouraged to vote.
    • Told that whether one votes is a matter of public record.
    • Shown whether members of their own household voted in the last two elections.
  • T4: Neighbors
    • Like Self but in addition recipients are shown whether the neighbors on the block voted in the last two elections.

Example: Social Pressure Experiment

Example: Social Pressure Experiment




Control
(Not Mailed)
Civic Duty
(Encouraged to Vote)
Hawthorne
(Encouraged & Monitored)
Self
(Encouraged, Monitored, Shown Own Past Voting)
Neighbors
(Encouraged, Monitored, Shown Own & Others’ Past Voting)
Percent Voting \(29.7\%\) \(31.5\%\) \(32.2\%\) \(34.5\%\) \(37.8\%\)
\(N\) of Individuals \(191,243\) \(38,218\) \(38,204\) \(38,218\) \(38,201\)



Basic Setup

Basic Setup for Randomized Experiment


  • Units: \(i \in \{1, \ldots, N\}\)

  • Treatment: \(T_i \in \{0, 1\}\), randomly assigned.

  • Potential outcomes: \(Y_i(0)\) and \(Y_i(1)\).

  • Observed outcome: \(Y_i = T_i Y_i(1) + (1-T_i) Y_i(0)\) (consistency).

  • Treatment Assignment Mechanism:

    1. Complete randomization: Exactly \(N_1\) units are treated.
    2. Bernoulli (simple) randomization: Each unit is independently assigned to treatment with probability \(p\).
  • Randomization (complete or simple) implies

\[ \{Y_i(1), Y_i(0)\} \ {\mbox{$\perp\!\!\!\perp$}}\ T_i \]

Identification of ATE

  • Causal Estimand: Still ATE.

    \[ \tau_{ATE} \equiv {\mathbb{E}}\{Y_i(1) - Y_i(0)\} \]

  • Still not directly estimable as we don’t observe \(Y_i(1) - Y_i(0)\) for each unit

  • Identification Question: Can we write down \(\tau_{ATE}\) only with observed data (\(Y_i, T_i\))?

RESULT: Identification under Randomization

\[ \begin{aligned} {\mathbb{E}}\{Y_i(1) - Y_i(0)\} &= {\mathbb{E}}\{Y_i(1)\} - {\mathbb{E}}\{Y_i(0)\} \quad \text{($\because$ linearity of ${\mathbb{E}}$)} \\ &= {\mathbb{E}}\{Y_i(1) {\:\vert\:}T_i = 1\} - {\mathbb{E}}\{Y_i(0) {\:\vert\:}T_i = 0\} \quad \text{($\because$ randomization of $T_i$)} \\ &= {\mathbb{E}}[Y_i {\:\vert\:}T_i = 1] - {\mathbb{E}}[Y_i {\:\vert\:}T_i = 0] \quad \text{($\because$ consistency of PO)} \end{aligned} \]

  • Estimation Question: Can we estimate \({\mathbb{E}}[Y_i {\:\vert\:}T_i = 1] - {\mathbb{E}}[Y_i {\:\vert\:}T_i = 0]\)?

\[ \frac{1}{N_1} \sum_{i=1}^N T_i Y_i - \frac{1}{N_0} \sum_{i=1}^N (1 - T_i) Y_i \]

Without Randomization

  • Without randomization we have \(\{Y_i(1), Y_i(0)\} {\mbox{$\centernot{\perp\!\!\!\perp}$}}T_i\).
  • This implies \[ {\mathbb{E}}\{Y_i(1)\} \neq {\mathbb{E}}\{Y_i(1) {\:\vert\:}T_i = 1\}, \quad {\mathbb{E}}\{Y_i(0)\} \neq {\mathbb{E}}\{Y_i(0) {\:\vert\:}T_i = 0\} \]

    • e.g., people who read newspapers are more interested in politics.
  • Without randomization, treatment and control groups are different with respect to pre-treatment covariates.

  • Pre-treatment covariates: Variables that are not affected by the treatment.

  • Importantly, potential outcomes are pre-treatment covariates!

  • Note: observed outcomes are post-treatment covariates!

  • Intuition: Randomization makes treatment and control groups similar on average with respect to all observed and unobserved pre-treatment covariates.

The Struggle is Real

Estimation of SATE

Design-Based Inference

  • Consider finite population and focus on design-based inference

    • Essentially: focus only on the randomness induced by the treatment assignment.
    • e.g. finite-population inference or (later) randomization inference.
  • Treatment variables \((T_1, \ldots, T_N)\) are random.

  • Units and potential outcomes (\(Y_i(1), Y_i(0)\)) are fixed.

  • We now distinguish Sample Average Treatment Effect (SATE):

    \[ \tau_{SATE} \equiv \frac{1}{N} \sum_{i=1}^N \{Y_i(1) - Y_i(0)\} \]

  • Randomization is the “reason basis for inference” (Fisher 1936)

  • Design-based inference:

    • Advantage: Rely only on the treatment assignment mechanism that researchers control instead of untestable distributional assumptions (e.g., i.i.d data, normal errors, outcome models).
    • Disadvantage: sometimes less flexible

Unbiasedness of Difference-in-Means

  • Difference-in-Means (DiM) estimator is \(\widehat{\tau}_{DiM} \equiv \frac{1}{N_1} \sum_{i=1}^N T_i Y_i - \frac{1}{N_0} \sum_{i=1}^N (1 - T_i) Y_i\)
  • Unbiased for the SATE under complete randomizationno modeling assumption is needed.

  • First suppose, \(\mathcal{O}_N = \{Y_i(1), Y_i(0)\}_{i=1}^N\), then:

\[ \begin{aligned} {\mathbb{E}}[\widehat{\tau}_{DiM} {\:\vert\:}\mathcal{O}_N] &\class{fragment}{{}= \frac{1}{N_1} \sum_{i=1}^N {\mathbb{E}}[T_i Y_i {\:\vert\:}\mathcal{O}_N] - \frac{1}{N_0} \sum_{i=1}^N {\mathbb{E}}[(1 - T_i) Y_i {\:\vert\:}\mathcal{O}_N] \quad \text{($\because$ linearity of ${\mathbb{E}}$)}} \\ &\class{fragment}{{}= \frac{1}{N_1} \sum_{i=1}^N {\mathbb{E}}[T_i Y_i(1) {\:\vert\:}\mathcal{O}_N] - \frac{1}{N_0} \sum_{i=1}^N {\mathbb{E}}[(1 - T_i) Y_i(0) {\:\vert\:}\mathcal{O}_N] \quad \text{($\because$ consistency of PO)}} \\ &\class{fragment}{{}= \frac{1}{N_1} \sum_{i=1}^N {\mathbb{E}}[T_i {\:\vert\:}\mathcal{O}_N] Y_i(1) - \frac{1}{N_0} \sum_{i=1}^N {\mathbb{E}}[1 - T_i {\:\vert\:}\mathcal{O}_N] Y_i(0) \quad \text{($\because$ POs are fixed)}} \\ &\class{fragment}{{}= \frac{1}{N} \sum_{i=1}^N Y_i(1) - \frac{1}{N} \sum_{i=1}^N Y_i(0) \quad \text{($\because$ complete randomization)}} \end{aligned} \]

IPW Estimator as Generalization

  • Inverse Probability Weighting Estimator (Horvitz–Thompson estimator) for the SATE:

    \[ \widehat{\tau}_{IPW} \equiv \frac{1}{N} \sum_{i=1}^N \left\{\frac{T_iY_i}{p_i} - \frac{(1 - T_i) Y_i}{(1 - p_i)}\right\}, \]

    where \(p_i = {\textrm{Pr}}(T_i = 1 {\:\vert\:}\mathcal{O}_N)\).

  • Can prove that if probabilities of assignment are fixed, i.e. \(\forall i:\: p_i = p\), we have \({\mathbb{E}}[ \widehat{\tau}_{IPW} {\:\vert\:}\mathcal{O}_N ] = \tau_{SATE}\). (How would you approach this?) (proof)
  • This estimator is more general:

    1. DiM is a special case of IPW estimator when… \(\forall i:\: p_i = N_1/N\). Note: \(p_i\)’s do not have to be \(0.5\)!
    2. IPW allows to account for complex designs with unequal probabilities of assignment across units.
    3. IPW estimator will show up in observational studies as well.

Finite Population Correction

  • Under complete randomization, treatment is assigned without replacement: exactly \(N_1\) of \(N\) units are treated
  • This is analogous to simple random sampling without replacement (SRSWOR) from a finite population
  • For SRSWOR of \(n\) units from a population of \(N\), the variance of the sample mean is:

\[ {\mathbb{V}}[\overline{Y}] = \frac{S^2}{n} \underbrace{\frac{N - n}{N}}_{FPC} \]

  • Intuition: The factor \(\frac{N-n}{N} = 1 - \frac{n}{N}\) reduces variance because sampling w/o replacement removes variability by removing units from population

    • When \(n \ll N\): FPC \(\approx 1\) (negligible correction)
    • When \(n = N\): FPC \(= 0\) (no randomness left – census!)

Variance of DiM


  • Using the FPC for each group mean, \(\overline{Y}_t = \frac{1}{N_t} \sum_{j=1}^N \mathbb{1} [T_i = t] Y_j\), we can derive the variance of the DiM:

\[ \begin{align*} {\mathbb{V}}[\widehat{\tau}_{DiM} {\:\vert\:}\mathcal{O}_N] &= {\mathbb{V}}[\overline{Y}_1 {\:\vert\:}\mathcal{O}_N] + {\mathbb{V}}[\overline{Y}_0 {\:\vert\:}\mathcal{O}_N] - 2 {\mathrm{cov}}(\overline{Y}_1, \overline{Y}_0 {\:\vert\:}\mathcal{O}_N) \quad \text{($\because$ variance of difference)} \\ &= \frac{S_1^2}{N_1} \frac{N - N_1}{N} + \frac{S_0^2}{N_0} \frac{N - N_0}{N} + 2 \frac{\textcolor{#d65d0e}{S_{10}}}{N} \quad \text{($\because$ \textit{FPC};}\; {\mathrm{cov}}(\overline{Y}_1, \overline{Y}_0) = -S_{10}/N \href{#proof-cov-neg}{\text{(proof)}}\text{)} \\ &= \frac{S_1^2}{N_1} + \frac{S_0^2}{N_0} - \frac{S_1^2 + S_0^2 -2 \textcolor{#d65d0e}{S_{10}}}{N} \quad \text{($\because$ rearranging terms)} \\ &= \frac{S_1^2}{N_1} + \frac{S_0^2}{N_0} - \frac{\textcolor{#d65d0e}{S_{\tau}^2}}{N} \quad \text{($\because$ variance of difference)} \end{align*} \]

  • \(S_{t}^2\): Sample variance of \(Y_i(t)\) for \(t \in \{0,1\}\) \(\rightarrow\) identified
  • \(S_{10}^2\): Sample covariance of \(Y_i(1)\) and \(Y_i (0)\) \(\rightarrow\) unidentified
  • \(S^2_{\tau}\): Sample variance of \(Y_i(1) - Y_i(0)\) \(\rightarrow\) unidentified

What about Variance of DiM?

  • How do we deal with unidentified part? Be conservative
  • Conservative estimator: \({\mathbb{E}}[\widehat{\sigma}^2 {\:\vert\:}\mathcal{O}_N ] \geq {\mathbb{V}}[\widehat{\tau}_{DiM} {\:\vert\:}\mathcal{O}_N]\)

RESULT: Conservative Variance Estimator

\[ \widehat{\sigma}^2= \frac{1}{N_1} \widehat{S}_{1}^2 + \frac{1}{N_0} \widehat{S}_{0}^2 \] where \(\widehat{S}_{t}^2 = \frac{1}{N_t - 1} \sum_{i=1}^N \mathbb{1} [T_i = t] (Y_i - \overline{Y}_t)^2\), and in turn \(\overline{Y}_t = \frac{1}{N_t} \sum_{j=1}^N \mathbb{1} [T_i = t] Y_j\).

  • This estimator achieves its highest value, i.e. \({\mathbb{V}}[\widehat{\tau}_{DiM} {\:\vert\:}\mathcal{O}_N]\) when… treatment effects are constant across units.
  • Leads to conservative inferences:
    • Standard errors, \(\widehat{\sigma}\), will be in expectation at least as big as they should be.
    • Confidence intervals using \(\widehat{\sigma}\) will be in expectation at least wide as they should be.
    • Type I (false-positives) error rates will still be correct, power will be lower.

Estimation of PATE

SATE \(\rightarrow\) PATE



  • So far, we have assumed for simplicity that our data represent the entire population.

  • In reality, we often treat our experiment as a sample from the population.

  • Sampling introduces an additional layer of uncertainty in causal inference:

    1. \(N\) units are randomly sampled from the population.
    2. \(N_1\) units are then randomly assigned to the treatment.

    \(\rightsquigarrow\) Focus on Population ATE.

  • How does this affect our inference, in terms of:

    • Point estimates: Is observed difference in means (\(\widehat{\tau}\)) still unbiased?
    • Uncertainty estimates: Is the conventional standard error still valid?

Estimation of PATE


  • Assumption: simple random sampling from a super-population

    • \((Y_i(1), Y_i(0)) \stackrel{\rm i.i.d}{\sim}\) unknown super-population
  • Population Average Treatment Effect (PATE) is:

    \[ \tau_{PATE} \equiv {\mathbb{E}}[Y_i(1) - Y_i(0)] \]

  • DiM is unbiased (over repeated sampling and treatment assignment):

    \[ {\mathbb{E}}[\widehat{\tau}_{DiM}] = {\mathbb{E}}\left[ {\mathbb{E}}[\widehat{\tau}_{DiM} {\:\vert\:}\mathcal{O}_N] \right] = {\mathbb{E}}[\tau_{SATE}] = {\mathbb{E}}[Y_i(1) - Y_i(0)] = \tau_{PATE} \]

    • Note: This requires a true random sampling from the population.
  • Important: Often obtaining such a sample is impossible \(\rightsquigarrow\) External Validity

    • In such a case: focus on SATE and interpret as such (estimate is still internally valid, but no longer externally valid)

Variance for PATE

  • Now let’s characterize total uncertainty (sampling + design) for SATE, \({\mathbb{V}}[\widehat{\tau}_{DiM}]\).

  • Law of Total Variance: \(\underbrace{{\mathbb{V}}(Y)}_{\text{total variance}} = \underbrace{{\mathbb{E}}[{\mathbb{V}}(Y\mid X)]}_{\text{(mean of) "within" variance}} + \underbrace{{\mathbb{V}}({\mathbb{E}}[Y\mid X])}_{\text{"between" variance}}\) (see Angrist and Pischke 2009, Ch. 3).

  • Applying the LTV and denoting the population variance of \(Y_i(t)\) and \(\tau\) as \(\sigma_{t}^2\) and \(\sigma_{\tau}^2\):

    \[ \begin{align*} {\mathbb{V}}[\widehat{\tau}_{DiM}] &= {\mathbb{E}}\left[{\mathbb{V}}[\widehat{\tau}_{DiM} {\:\vert\:}\mathcal{O}_N]\right] + {\mathbb{V}}\left[{\mathbb{E}}[\widehat{\tau}_{DiM}\mid \mathcal{O}_N]\right] \\ &= {\mathbb{E}}\left[\frac{S_{Y_1}^2}{N_1} + \frac{S_{Y_0}^2}{N_0} - \frac{S_{\tau}^2}{N}\right] + {\mathbb{V}}[\tau_{SATE}] \\ &= {\mathbb{E}}\left[\frac{S_{Y_1}^2}{N_1} + \frac{S_{Y_0}^2}{N_0} - \frac{S_{\tau}^2}{N}\right] + {\mathbb{V}}\left[\frac{1}{N}\sum_{i \in \mathcal{O}_N}\tau_i\right] \\ &= {\mathbb{E}}\left[\frac{S_{Y_1}^2}{N_1} + \frac{S_{Y_0}^2}{N_0} - \frac{S_{\tau}^2}{N}\right] + \frac{\sigma_{\tau}^2}{N} \\ &= \frac{\sigma_{1}^2}{N_1} + \frac{\sigma_{0}^2}{N_0}. \end{align*} \]

Variance for PATE


  • Note: The same variance estimator is unbiased for the variance of the difference-in-means as an estimator for the PATE. \[ {\mathbb{E}}[\widehat{\sigma}^2] = {\mathbb{V}}[\widehat{\tau}_{DiM}] = \frac{1}{N_1} \sigma_{1}^2 + \frac{1}{N_0} \sigma_{0}^2 \]
  • This is in contrast to the result for the SATE:

    \[ {\mathbb{E}}[\widehat{\sigma}^2 {\:\vert\:}\mathcal{O}_N] \geq {\mathbb{V}}[\widehat{\tau}_{DiM} {\:\vert\:}\mathcal{O}_N] \]

  • Intuition: This variance estimator was always too large for SATE–we overestimated the variability.

  • But for PATE, because we have additional uncertainty, this becomes an unbiased estimator.

    • Further reading: Imbens and Rubin (2015, Ch. 6).

Neyman Inference for SATE or PATE

Weak Null

  • “Weak” null hypothesis (for PATE) (Neyman):

    \[ H_0^{\text{weak}}: {\mathbb{E}}[Y_i(1) - Y_i(0)] = 0 \quad \text{vs.} \quad H_a^{\text{weak}}: {\mathbb{E}}[Y_i(1) - Y_i(0)] \neq 0 \quad (\text{two-sided}) \]

  • PATE

    • Consistency via the law of large numbers: \(\widehat{\tau}_{DiM} \xrightarrow{p} \tau_{PATE}\).
    • Asymptotic normality via the CLT

    \[ \frac{\widehat{\tau}_{DiM}- \tau_{PATE}}{\sqrt{\sigma^2_1/N_1 + \sigma_0^2/N_0}} \xrightarrow{d} \mathcal{N}(0, 1) \]

  • SATE

    \[ \frac{\widehat{\tau}_{DiM}- \tau_{SATE}}{\sqrt{S^2_1/N_1 + S_0^2/N_0 - S^2_{10}/N}} \xrightarrow{d} \mathcal{N}(0, 1) \]

  • \((1 - \alpha) \times 100\)% CI: \([\widehat{\tau}_{DiM} - \widehat{\sigma} \times z_{1 - \alpha/2}, \widehat{\tau}_{DiM} + \widehat{\sigma} \times z_{1 - \alpha/2}]\)

Neyman Inference with Regression

  • For a binary treatment (\(T_i \in \{0,1\}\)) we can show:

    • Simple regression coefficient is numerically equal to DiM:

    \[ \widehat\beta_{OLS} \ \equiv\ \frac{\widehat{{\mathrm{cov}}}(Y_i, T_i)}{\widehat{{\mathbb{V}}}(T_i)} \ =\ \widehat{\tau}_{DiM} \]

    • Heteroskedasticity-robust variance (the \(HC2\) variant) is also numerically equal to the conservative Neyman variance:

    \[ \widehat\sigma^2_{HC2} \ = \frac{1}{N_1} \widehat{S}_1^2 + \frac{1}{N_0} \widehat{S}_0^2 \]

  • In practice in a completely randomized experiment:

    1. Regress \(Y_i\) on \(T_i\) (w/ intercept) and get the coefficient on \(T_i\).
    2. Calculate the robust standard error (estimatr::lm_robust()).
    3. Calculate confidence intervals, etc. as usual.

Neyman in Practice: Point Estimates


# load packages
pacman::p_load(
  tidyverse,
  labelled,
  haven,
  estimatr,
  sandwich
)

# load data
gerber <- haven::read_dta("../_data/gerber.dta")

# check how treatment and outcome are coded
# labelled::get_value_labels(gerber$treatment)
# labelled::get_value_labels(gerber$voted)

# calculate difference-in-means by hand
est_dim <-
  gerber |>
    (
      \(.)
        c(
          hawthorne = mean(.$voted[.$treatment == 1]) -
            mean(.$voted[.$treatment == 0]),
          civic = mean(.$voted[.$treatment == 2]) -
            mean(.$voted[.$treatment == 0]),
          neighbor = mean(.$voted[.$treatment == 3]) -
            mean(.$voted[.$treatment == 0]),
          self = mean(.$voted[.$treatment == 4]) -
            mean(.$voted[.$treatment == 0])
        )
    )()

# calculate difference-in-means using regression
est_lm <-
  estimatr::lm_robust(
    voted ~ factor(treatment),
    data = gerber
  ) |>
    estimatr::tidy() |>
    dplyr::pull(estimate)

bind_cols(
  treatment = names(est_dim),
  est_dim = unname(est_dim),
  est_lm = est_lm[-1]
) |>
  knitr::kable(digits = 3) |>
  kableExtra::kable_minimal(font_size = 20)
treatment est_dim est_lm
hawthorne 0.026 0.026
civic 0.018 0.018
neighbor 0.081 0.081
self 0.049 0.049

Neyman in Practice: Uncertainty


# calculate standard errors by hand
s_2 <-
  gerber |>
    (
      \(.)
        c(
          control = var(.$voted[.$treatment == 0]) /
            sum(.$treatment == 0),
          hawthorne = var(.$voted[.$treatment == 1]) /
            sum(.$treatment == 1),
          civic = var(.$voted[.$treatment == 2]) /
            sum(.$treatment == 2),
          neighbor = var(.$voted[.$treatment == 3]) /
            sum(.$treatment == 3),
          self = var(.$voted[.$treatment == 4]) /
            sum(.$treatment == 4)
        )
    )()

se_hand <- sapply(2:5, function(i) {
  sqrt(s_2[i] + s_2["control"])
})

# can calculate by hand
se_sandwich <-
  lm(voted ~ factor(treatment), data = gerber) |>
    vcovHC(type = "HC2") |>
    diag() |>
    sqrt()

# can get it directly now
se_robust <-
  estimatr::lm_robust(
    voted ~ factor(treatment),
    data = gerber,
    se_type = "HC2"
  ) |>
    estimatr::tidy() |>
    dplyr::pull(std.error)

bind_cols(
  treatment = names(s_2[-1]),
  se_hand = se_hand,
  se_sandwich = se_sandwich[-1],
  se_robust = se_robust[-1],
) |>
  knitr::kable(digits = 5) |>
  kableExtra::kable_minimal(font_size = 20)
treatment se_hand se_sandwich se_robust
hawthorne 0.00261 0.00261 0.00261
civic 0.00259 0.00259 0.00259
neighbor 0.00269 0.00269 0.00269
self 0.00265 0.00265 0.00265

Fisher Inference for SATE

Foundations of Statistics

Lady Tasting Tea


  • Fisher (1936): Does tea taste different depending on whether the tea was poured into the milk or whether the milk was poured into the tea?
  • Lady Tasting Tea Experiment

    • Units: 8 identical cups

    • Randomization: Randomly choose 4 cups into which the tea is poured first, and for the other four, the milk was poured first

    • Null hypothesis: the lady cannot tell the difference

    • Statistic: the number of correctly classified cups

    • Outcome: The lady classified all 8 cups correctly!

  • Did this happen by chance?

Permutation Test

  • \(\binom{8}{4}= \frac{8!}{4!(8-4)!} = 70\) ways to pour teas each corresponding to number of correct guesses

  • only \(1\) corresponds to guessing all cups correctly!

  • Under the \(H_0\) (lady guessing at random), the probability that the lady classifies all cups correctly is \(1/70 \approx 0.014\).

  • \(p = {\textrm{Pr}}(\text{guessing all cups correctly}\) \({\:\vert\:}\text{guessing at random}) = 0.014\) \(\rightarrow\) Reject the null of guessing at random!

Basic Setup for Fisher’s Exact Test


  • Units: \(i \in \{1, \ldots, N\}\)
  • Treatment: \(T_i \in \{0, 1\}\), randomly assigned
  • Potential outcomes: \(Y_i(0)\) and \(Y_i(1)\)
  • Observed outcome: \(Y_i = T_i Y_i(1) + (1-T_i) Y_i(0)\) (consistency)
  • Treatments are assigned with some known treatment assignment mechanism
    • If researchers can reproduce it, fine to be very complicated
    • e.g., complete, Bernoulli, Block, Cluster, or any complex randomization
  • “Sharp” null hypothesis of no treatment effect: \[ H^{sharp}_0: Y_i(1) = Y_i(0) \: \forall i \quad \text{vs.} \quad H^{sharp}_a: \exists i:\: Y_i(1) \neq Y_i(0) \]
  • Very different from the “weak” null hypothesis!

Sharp Null


  • Fisher’s sharp null hypothesis: \(Y_i(1) = Y_i(0)\) for all units.

  • Key idea: Under the sharp null, we “observe” all potential outcomes!

  • We can compute the exact \(p\)-value to test this sharp null hypothesis.

  • Example: GOTV experiments:
Voters \(i\) Contact \(T_i\) Turnout \(Y_i\) Potential Turnout \(Y_i(1)\) Potential Turnout \(Y_i(0)\)
1 1 1 1 ?
2 0 0 ? 0
3 1 1 1 ?
4 1 0 0 ?
5 0 1 ? 1
  • Estimate: \(\widehat{\tau} = \frac{2}{3} - \frac{1}{2} = \frac{1}{6}\)

  • Is this statistically significant? How do we compute \(p\)-value?

Computing the Null Distribution


Voters \(i\) Turnout \(Y_i\) Contact \(T_i\) \(\widetilde{T}^1_i\) \(\widetilde{T}^2_i\) \(\widetilde{T}^3_i\) \(\ldots\)
1 1 1 1 1 1 \(\ldots\)
2 0 0 1 1 0 \(\ldots\)
3 1 1 1 0 1 \(\ldots\)
4 0 1 0 1 0 \(\ldots\)
5 1 0 0 0 1 \(\ldots\)
\(\widehat{\tau}\) \(\frac{1}{6}\) \(-\frac{2}{3}\) \(1\) \(\ldots\)
  • The null (\(\approx\) sampling) distribution of the test statistic is \(\{\widehat{\tau}_k\}_{k=1}^K\), where \[ \widehat{\tau}_k = \frac{\sum_{i=1}^N \widetilde{T}^k_i Y_i}{\sum_{i=1}^N \widetilde{T}^k_i} - \frac{\sum_{i=1}^N (1-\widetilde{T}^k_i) Y_i}{\sum_{i=1}^N (1-\widetilde{T}^k_i)} \]

  • Exact (two-sided) \(p\)-value is \(p = \frac{1}{K} \sum_{k=1}^K \mathbb{1} \left[ |\widehat{\tau}_k| > |\widehat{\tau}|\right]\), where \(\widehat{\tau}\) is the observed test statistic

  • Note: If \(K\) (the number of potential treatment assignment) is large, use simulations!

Example: Social Pressure Experiment


# load data
gerber <- haven::read_dta("../_data/gerber.dta")

# observed test statistics
lm_obs <- lm(voted ~ factor(treatment), data = gerber)
obs_dim <- coef(lm_obs)[2:5]

# Fisher’s exact test
sim_dim <-
  pbapply::pbreplicate(1000, {
    sim_treatment <-
      sample(gerber$treatment,
        size = length(gerber$treatment), replace = FALSE
      )
    lm_sim <- lm(gerber$voted ~ factor(sim_treatment))
    coef(lm_sim)[2:5]
  }, cl = 8)

# p-values
mean(abs(sim_dim[1,]) > abs(obs_dim[1])) # two-sided for Hawthorne
mean(abs(sim_dim[2,]) > abs(obs_dim[2])) # two-sided for Civic
mean(abs(sim_dim[3,]) > abs(obs_dim[3])) # two-sided for Neighbors
mean(abs(sim_dim[4,]) > abs(obs_dim[4])) # two-sided for Self

Example: Results

For Hawthorne treatment

For Neighbors treatment

For Civic Duty treatment

For Self treatment

Test Statistics

  • Fisher’s test also is flexible with respect to choice of specific test statistic.
  1. Difference-in-Means (or an estimator of the ATE):

    • Under the sharp null, this test statistic has mean zero.
    • Easy to interpret.
    • Disadvantage: The power might be lower than alternatives.
  1. Difference-in-Mean-Ranks (for continuous outcomes):

    \[ S = \left|\frac{\sum_{i=1}^N T_i R_i}{\sum_{i=1}^N T_i} - \frac{\sum_{i=1}^N (1-T_i) R_i}{\sum_{i=1}^N (1-T_i)} \right| \]

    • Rank of the outcome for unit \(i\): \(R_i = {\mathrm{rank}}(Y_1(T_1), \ldots, Y_N(T_N))\).
    • Advantages: Reference distribution does not depend on scale and is not sensitive to outliers.
  • To learn more read Imbens and Rubin (2015, Ch. 5).

General Procedure for Fisher’s Exact Test


  1. Specify a sharp null hypothesis

    • \(H_0: Y_i(1) - Y_i(0) = \tau_{0i}\), where we set some constant \(\tau_{0i}\) for each \(i\) (commonly you set \(\tau_{0i} = 0\)).
    • No effect implies no heterogeneous effect, no spillover effect, etc.
  1. Choose a test statistic \(S = f(\{Y_i, T_i, \tau_{0i}\}_{i=1}^N)\)

    • Difference-in-Means, Difference-in-Mean-Ranks, etc.
    • Any statistic gives a valid and exact \(p\)-value but power may differ
    • Could use regression models or machine learning algorithms
  1. Compute the reference distribution and \(p\)-value based on the randomized distribution of treatment assignment
    • Exact distribution in small samples
    • Monte Carlo approximation as a general strategy

Application Beyond Randomized Experiments

  • The California Alphabet Lottery (Ho and Imai 2006).

  • Randomization sometimes occurs in the real world.

  • Started in 1975: “[B]oth the ‘incumbent first’ and ‘alphabetical order’ procedures are constitutionally impermissible.” (Gould v. Grubb, 14 Cal. 3d 661, 676).

  • A random alphabet is drawn for every statewide election that applies to all statewide offices.

  • Candidates are ordered by this randomized alphabet for the first of 80 assembly districts and are rotated for each subsequent assembly district.

Lottery Drawing

Lottery Cannisters

California Elections Code 13112(a)




“Each letter of the alphabet shall be written on a separate slip of paper, each of which shall be folded and inserted into a capsule. Each capsule shall be opaque and of uniform weight, color, size, shape, and texture. The capsules shall be placed in a container, which shall be shaken vigorously in order to mix the capsules thoroughly. The container then shall be opened and the capsules removed at random one at a time. As each is removed, it shall be opened and the letter on the slip of paper read aloud and written down. The resulting random order of letters constitutes the randomized alphabet, which is to be used in the same manner as the conventional alphabet in determining the order of all candidates in all elections. For example, if two candidates with the surnames Campbell and Carlson are running for the same office, their order on the ballot will depend on the order in which the letters M and R were drawn in the randomized alphabet drawing.”

Fisher’s Exact Test for the Natural Experiment

  • Take into account the complex lottery procedure

    1. Randomize alphabet.
    2. Sort candidates by randomized alphabet.
    3. Rotate the candidate order.

    \(\rightsquigarrow\) Impossible via model-based inference!

  • Ho and Imai (2006) rely on data from the 2003 CA Gubernatorial Recall Election

    • 135 candidates.
    • Ballot order differs across 80 districts.
    • assembly district-counties that have more than one page ballots.
  • Setup:

    • Null hypothesis: no causal effect of being on first ballot page for any candidate.
    • Test statistic: DiM between being on the first ballot page and being on other ballot pages.
    • Reference (randomization) distribution via Monte Carlo simulations.

Distribution of Exact \(p\)-values across Candidates

Randomized alphabet (2003 recall election):

R W Q O J M V A H B S G Z X N T C I E K U P D Y F L

Diagnose Design via Placebo Tests


  • Placebo tests: used when effects are known to be zero
  • Null hypothesis is assumed to be true
  • Ballot order should not affect pre-election covariates


Practical Considerations and Extensions


  • Practical Considerations

    • In most experiments, researchers focus on the ATE and use the Neyman approach to construct confidence intervals.

    • Because a sharp null hypothesis is often not interesting to social scientists.

    • Consider the Fisherian exact test

      1. When you have a small sample size (avoid if possible!), or

      2. When you have a complex (but known!) treatment assignment mechanism (e.g., natural experiment).

Imbalance

Covariates in Experiments?

Treatment Imbalance

  • Randomization balances both observed and unobserved pre-treatment covariates between the treated and untreated in large samples \(\rightsquigarrow\) covariate imbalance is generally not a concern.
  • But
    • In small samples, you may get unlucky and suffer from imbalance.
    • In a “natural” randomized experiment, it’s important to check whether randomization occurred as you thought.
  • Common practice: Conduct balance checks with respect to observed pre-treatment covariates.

    • Compare means, standard deviations, etc., between the treated and untreated.
    • Can also regress treatment indicator on covariates.
    • Visual inspection of histograms/density plots.
    • Many packages have balance tests, e.g. RItools::xbalance().

What If You Found Imbalance?



  • Can correct imbalance via regression, matching, weighting, etc. (more on matching and weighting later).

  • Covariate adjustment can also improve efficiency–that is, reduce the randomization/sampling distribution variance of our estimate of \(\tau\) while maintaining consistency.

  • But it may also produce bias, such as:

    • Adjusting for covariates that are weakly predictive of outcomes in small samples.
    • Bias due to post-hoc analysis (p-hacking).
    • Bias due to incorrectly adjusting for post-treatment covariates.

Covariate Adjustment in Experiments


  • Is it a good idea to control for pre-treatment covariates via linear regression in a randomized experiment? We need to look at the bias-variance tradeoff.
  • Potential benefit: If \(X_i\) predicts \(Y_i\), including it as covariate, \(Y_i = \alpha + \tau T_i + X_i^\prime \gamma\), reduces residual variance \(\Rightarrow\) smaller SEs \(\Rightarrow\) precision gain.
  • Pitfall 1 — Overfitting: In cases where \(X_i\) is weakly predictive of \(Y_i\), naive adjustment introduces additional noise and can even bias estimates in small samples (Freedman 2008).
  • Pitfall 2 — Common-slope constraint: Naive adjustment forces parallel regression lines with respect to \(X\) and limits precision gains.

    • Natural fix is full interaction model, \(Y_i = \alpha + \tau T_i + X_i^\prime \gamma + T_i X_i^\prime \xi\), to allow different slopes. But this induces \({\mathrm{cov}}(T_i, T_i \tilde{X}_i) \neq 0\) and \(\tau\) is the treatment effect at \(X = 0\), not the ATE.

Covariate Adjustment with Regression


  • Lin (2013) proposes demeaning covariates and including interaction terms:

\[ Y_i = \alpha + \tau T_i + (X_i - \overline{X})' \gamma + T_i (X_i - \overline{X})' \delta + \varepsilon_i \]

  • This addresses both pitfalls:

    1. Interactions \(\Rightarrow\) remove common-slope constraint \(\Rightarrow\) flexible fit
    2. Demeaning \(\Rightarrow\) \(\tau\) is the effect at \(\overline{X}\) (\(\approx\) ATE); \({\mathrm{cov}}(T_i, T_i \tilde{X}_i) = 0\) by construction. (proof)
    3. Variance guarantee: asymptotic variance of \(\widehat{\tau}_{Lin}\) \(\leq\) variance of DiM \(\Rightarrow\) you can only gain precision, never lose it.
    4. Bias is \(O(1/N)\) and decreases with sample size \(\Rightarrow\) consistent; use HC2 robust SEs for valid inference.
  • Further reading: Lei and Ding (2021) propose a procedure to debias Lin’s estimator even in small samples.

Lin (2013) Covariate Adjustment


# simulate data
set.seed(972)

N <- 100
X <- rnorm(N, mean = 3)
D <- randomizr::complete_ra(N)
Y <-
  2 + 1.5 * D + 0.1 * X +
  3 * D * (X - mean(X)) + rnorm(N, sd = 1.5)

# simple difference-in-means
dim_model <-
  estimatr::lm_robust(Y ~ D) |> estimatr::tidy()

# naive covariate adjustment (common slopes)
unadjusted_model <-
  estimatr::lm_robust(Y ~ D + X) |> estimatr::tidy()

# demean covariates
X_centered <- X - mean(X)

# Lin estimator (hand-coded)
adjusted_model <-
  estimatr::lm_robust(Y ~ D * X_centered) |> estimatr::tidy()

# Lin estimator (estimatr)
adjusted_model2 <-
  estimatr::lm_lin(Y ~ D, covariates = ~X) |>
  estimatr::tidy()

bind_rows(
  dim_model, unadjusted_model, adjusted_model, adjusted_model2
) |>
  dplyr::filter(term == "D") |>
  dplyr::mutate(model = c("unadjusted", "naive", "Lin (hand)", "Lin (estimatr)")) |>
  dplyr::select(model, term, estimate, std.error) |>
  knitr::kable(digits = 5, align = "lccc") |>
  kableExtra::kable_minimal(font_size = 22)
model term estimate std.error
unadjusted D 1.43687 0.50032
naive D 1.52321 0.46324
Lin (hand) D 1.55665 0.30858
Lin (estimatr) D 1.55665 0.30858

Lin (2013) Covariate Adjustment



DiM: Y ~ D

Naive: Y ~ D + X

Lin: Y ~ D * (X − X̄)

Experimental Design Considerations

Unequal Probabilities of Assignment



  • In some experiments, units may have different probabilities of being assigned to the treatment group.

  • This can occur due to:

    • Block (stratified) randomization with different probabilities within strata or unequal sized clusters.

    • Practical constraints or ethical considerations.

    • Adaptive designs where probabilities change over time (Offer-Westort, Coppock, and Green 2021).

  • Problem: Unequal probabilities can bias the Difference-in-Means (DiM) estimator.

  • Solution: Use the Inverse Probability Weighting (IPW) that we introduced before estimator to account for unequal probabilities.

Bias in Difference-in-Means

  • Recall the DiM estimator: \(\widehat{\tau}_{DiM} = \frac{1}{N_1} \sum_{i=1}^N T_i Y_i - \frac{1}{N_0} \sum_{i=1}^N (1 - T_i) Y_i\)

  • Suppose \(p_i = {\textrm{Pr}}(T_i = 1)\) varies across units, but we have complete random assignment (\(N_1\) and \(N_0\) are fixed).

  • The expected value of the DiM estimator is:

\[ \begin{align*} {\mathbb{E}}[\widehat{\tau}] &= {\mathbb{E}}\left[ \frac{1}{N_1}\sum_{i=1}^N T_i Y_i(1) \right] - {\mathbb{E}}\left[ \frac{1}{N_0}\sum_{i=1}^N (1 - T_i) Y_i(0) \right] \\ &= \sum_{i=1}^N {\mathbb{E}}\left[\frac{T_i}{N_1}\right] Y_i(1) - \sum_{i=1}^N {\mathbb{E}}\left[\frac{(1 - T_i)}{N_0}\right] Y_i(0) \\ &= \frac{\sum_{i=1}^N p_i\,Y_i(1)}{\sum_{j=1}^N p_j} - \frac{\sum_{i=1}^N (1-p_i)\,Y_i(0)}{\sum_{j=1}^N (1-p_j)} \\ &\neq \tau_{ATE}. \end{align*} \]

  • Intuition: The DiM over-weights observations with higher assignment probabilities and is generally not equal to the true ATE (unless \(p_i\) is constant for all units or \(\tau_i\) is constant).

Inverse Probability Weighting (IPW)


  • The IPW estimator corrects for unequal probabilities:

\[ \widehat{\tau}_{IPW} = \frac{1}{N} \sum_{i=1}^N \left( \frac{T_i Y_i}{p_i} - \frac{(1 - T_i) Y_i}{1 - p_i} \right) \]

  • Properties:

    • The IPW estimator is unbiased for the ATE: \({\mathbb{E}}[\widehat{\tau}_{IPW}] = \tau_{ATE}\)
    • Can be used for estimation of SATE or PATE.
    • Can be obtained using regression with inverse probability weights (e.g. estimatr::lm_robust() with weights argument).
  • Intuition: Weight each unit by the inverse of its probability of receiving the observed treatment to compensate for over-weighting.

  • Note: The IPW estimator is unbiased and consistent if the treatment assignment mechanism is known and correctly specified.

Example: Bias of DiM vs IPW Estimator

Cluster Randomized Experiments

Cluster Randomization

  • So far, we have assumed treatments are assigned at the individual level.

  • Sometimes random assignment occurs at the cluster level for various reasons:

    • Treatment only makes sense at the group level, but the outcome is measured for individuals.

    • Treatment too costly to implement individually.

    • SUTVA only plausible if treatment is defined at the group level.

    • Example: Effect of classroom teaching method on student performance.

  • Standard errors ignoring cluster randomization are usually too small (opposite of conservative).

  • This is due to units within the same cluster typically being more similar than units in different clusters.

Warning

“Analyses of group randomized trials that ignore clustering are an exercise in self-deception.” (Cornfield 1978)

Randomization at the Group Level

Bias of DiM in Cluster-Randomized Experiments


  • When clusters have unequal sizes, the DiM estimator can produce biased estimates of \(\tau_{ATE}\) if the cluster sizes are correlated with the potential outcomes.

    • Example: If larger clusters tend to have higher potential outcomes, the DiM estimator will overestimate the treatment effect.
  • Intuition: The DiM estimator gives equal weight to each cluster, regardless of size, which can lead to biased estimates if cluster sizes are not balanced.

  • To address the issue, as before, we can:

    • Use weighted estimators, such as the IPW estimator, or
    • Regression weighted by inverse of probabilities of assignment to account for unequal cluster sizes.

Intracluster Correlation


  • Recall the Law of Total Variance: \({\mathbb{V}}(Y) = {\mathbb{E}}[{\mathbb{V}}(Y\mid X)] + {\mathbb{V}}({\mathbb{E}}[Y\mid X])\)

  • This implies the decomposition of heterogeneity in outcomes:

\[ \underbrace{\sum_{j=1}^G \sum_{i=1}^{N_j} (Y_{ij} - \overline{Y})^2}_{\text{overall variance, } \sigma^2} = \underbrace{\sum_{j=1}^G \sum_{i=1}^{N_j} (Y_{ij} - \overline{Y}_{j})^2}_{\text{within-cluster variance, } \sigma^2_W} + \underbrace{\sum_{j=1}^G N_{j}(\overline{Y}_{j} - \overline{Y})^2,}_{\text{between-cluster variance, } \sigma^2_B} \]

where \(\overline{Y_{j}}\) is mean of \(Y_{ij}\) in cluster \(j\) and \(\overline{Y}\) is mean of all \(Y_{ij}\).

  • Then we can define the intracluster correlation: \(\rho = \frac{\sigma^2_B}{\sigma^2} = 1 - \frac{\sigma^2_W}{\sigma^2}\)

  • Intuition: When \(\rho\) is \(1\) (\(0\)), responses are identical (uncorrelated) within each cluster.

Inference in Cluster-Randomized Experiments

  • We can show cluster randomization inflates the sampling variance (compared to complete randomization) approximately by Moulton factor (design effect):

\[ \frac{{\mathbb{V}}(\hat\tau_{CL})}{{\mathbb{V}}(\hat\tau_{R})} = 1 + (\overline{N} - 1)\rho, \quad \text{where} \quad \bar{N} = \frac{1}{G} \sum_{j=1}^G N_j \]

  • Intuition: When \(\rho = 1\), outcomes do not vary within clusters (\(\sigma^2_{W} = 0\)), each cluster is essentially one observation. As a result, the effective sample size is number of clusters, not number of units, and variance is inflated. When \(\rho = 0\) all clusters are similar to the whole sample.
  • Valid inference:

    • OLS and DiM estimators are unbiased for \(\tau_{ATE}\) if clusters are equally sized.
    • Possible bias if cluster sizes vary and are correlated with potential outcomes.
    • Rely on cluster-robust standard errors (CR2) or randomization inference.
  • Note: When \(G\) is small, \(\rho\) will be poorly estimated and cluster SEs will be unreliable \(\rightsquigarrow\) prefer increasing \(G\) over sample size per cluster (\(N_j\)).

Simulation: Ignoring Clustering in SE Estimates

  • Ignoring clustering in standard error estimates can result in overly “optimistic” confidence intervals and increased Type I error (false-positives) rates.

  • The simulation shows how coverage–probability of 95% CIs including the true ATE across replications of the experiment–is affected by ratio of between- to within-cluster variance:

    • We would expect the these CIs will include true ATE in 95% of the cases.
    • HC2 are CIs based on heteroskedastisity robust standard errors.
    • CR2 are CIs based on cluster robust standard errors.


Example: Field Experiment in Benin





Wantchekon (2003)

Example: Field Experiment in Benin

Example: Field Experiment in Benin

Block Randomized Experiments

Pre- vs. Post-treatment Adjustment

  • We discussed pros/cons for covariate adjustment after randomization

  • But, why not do adjustment before randomization?

  • Basic idea: If you have data on pre-treatment characteristics \(X_i\), why leave it to pure chance, to balance them?

  • Example: \(n = 4\) with two males and two females.

    • Complete randomization will place two females in the same treatment group \(\frac{1}{3}\) of the time.

    • If that happens, how can we tell the treatment effect from gender difference?

  • Solution: Pre-stratify the sample, and then randomize completely within each stratum

    1. Blocking will perfectly balance \(X_i\).
    2. Randomization will balance the rest in expectation.

    Note

    \(\rightsquigarrow\) “Block what you can; randomize what you cannot.” (George Box)

Simple Two Block Example

  • In GOTV experiment, what if we have previous turnout data from the voter file?

    • Create blocks: \(V_i = 1\) if voted in last election, \(V_i = 0\) otherwise.
    • \(N_\text{v}\) is the number of previous voters.
    • \(N_\text{nv} = N - N_\text{v}\) is the number of previous nonvoters.
  • SATE within blocks is defined by \(V_i\):

\[ \tau_\text{v} = \frac{1}{N_\text{v}} \sum_{i:V_i=1} [ Y_i(1) - Y_i(0) ], \qquad \tau_\text{nv} = \frac{1}{N_\text{nv}} \sum_{i:V_i=0} [ Y_i(1) - Y_i(0) ] \]

  • Using Law of Iterated Expectation:

\[ \tau_{SATE} = \underbrace{\left( \frac{N_\text{v}}{N_\text{v} + N_\text{nv}} \right)}_{\text{share voters}} \tau_\text{v} + \underbrace{\left( \frac{N_\text{nv}}{N_\text{v} + N_\text{nv}} \right)}_{\text{share non-voters}} \tau_\text{nv} \]

Block Randomized Design

  • Block (stratified) randomized experiment:

    • Each block is essentially completely randomized experiment.
    • Choose \(N_{1,\text{v}}\) voters to be treated, \(N_{0,\text{v}} = N_{\text{v}} − N_{1,\text{v}}\) control.
    • Choose \(N_{1,\text{nv}}\) non-voters to be treated, \(N_{0,\text{nv}} = N_{\text{nv}} − N_{1,\text{nv}}\) control.
  • Probability of treatment in each group called the propensity score:

    • Prob. of treatment for voters: \({\textrm{Pr}}(T_i = 1 {\:\vert\:}V_i = 1) = p_{\text{v}} = \frac{N_{1,\text{v}}}{N_{\text{v}}}\).
    • Prob. of treatment for non-voters: \({\textrm{Pr}}(T_i = 1 {\:\vert\:}V_i = 0) = p_{\text{nv}} = \frac{N_{1,\text{nv}}}{N_{\text{nv}}}\).
  • Blocking ensures balance across blocks:

    • When \(p_{\text{v}} = p_{\text{nv}}\), distribution of treatment is exactly the same in each block.
    • With complete randomization, treatment might be very imbalanced (in absolute terms) across \(V_i\).
    • Benefit: No possibility of chance imbalances skewing the estimates.

Estimators in Blocked Designs

  • Within-strata DiM’s are:

\[ \begin{align*} \widehat{\tau}_\text{v} &= \overline{Y}_{1,\text{v}} - \overline{Y}_{0,\text{v}} = \frac{1}{N_{1,\text{v}}} \sum_{i:V_i=1} T_i Y_i - \frac{1}{N_{0,\text{v}}} \sum_{i:V_i=0} (1 - T_i) Y_i \\ \widehat{\tau}_\text{nv} &= \overline{Y}_{1,\text{nv}} - \overline{Y}_{0,\text{nv}} = \frac{1}{N_{1,\text{nv}}} \sum_{i:V_i=1} T_i Y_i - \frac{1}{N_{0,\text{nv}}} \sum_{i:V_i=0} (1 - T_i) Y_i \end{align*} \]

  • Property: Unbiased for the within-strata SATE’s: \({\mathbb{E}}[\widehat{\tau}_\text{v} {\:\vert\:}\mathcal{O}] = \tau_\text{v}\).
  • Unbiased estimator for the overall SATE for block design:

\[ \widehat{\tau}_{BR} = \left(\frac{N_\text{v}}{N}\right) \widehat{\tau}_\text{v} + \left(\frac{N_\text{nv}}{N}\right) \widehat{\tau}_\text{nv} \]

  • Property: Equivalent to the regular DiM if \(p_\text{v} = p_\text{nv} = \frac{1}{2}\).
  • Otherwise, standard \(\widehat{\tau}_{DiM}\) under block design will be biased.

Sampling Variance of Blocking Estimator

  • Each block is a completely randomized experiment so we have:

\[ {\mathbb{V}}[\widehat{\tau}_\text{v} {\:\vert\:}\mathcal{O}] = \frac{S^2_{1,\text{v}}}{N_{1,\text{v}}} + \frac{S^2_{0,\text{v}}}{N_{0,\text{v}}} - \frac{S^2_{\tau,\text{v}}}{N_{\text{v}}}, \]

where \(S^2_{t,\text{v}}\) is the within-block sample variances of the \(T_i = t\) potential outcomes or \(\tau\).

  • Finite sample variance of the blocked (BR) estimator:

\[ {\mathbb{V}}[\widehat{\tau}_{BR} {\:\vert\:}\mathcal{O}] = \left(\frac{N_\text{v}}{N}\right)^2 {\mathbb{V}}[\widehat{\tau}_\text{v} {\:\vert\:}\mathcal{O}] + \left(\frac{N_\text{nv}}{N}\right)^2 {\mathbb{V}}[\widehat{\tau}_\text{nv} {\:\vert\:}\mathcal{O}] \]

  • Use the conservative variance estimators from each strata to get:

\[ \widehat{\sigma}_{BR} = \left(\frac{N_\text{v}}{N}\right)^2 \left(\frac{\widehat{\sigma}^2_{1,\text{v}}}{N_{1,\text{v}}} + \frac{\widehat{\sigma}^2_{0,\text{v}}}{N_{0,\text{v}}}\right) + \left(\frac{N_\text{nv}}{N}\right)^2 \left(\frac{\widehat{\sigma}^2_{1,\text{nv}}}{N_{1,\text{nv}}} + \frac{\widehat{\sigma}^2_{0,\text{nv}}}{N_{0,\text{nv}}}\right), \]

where \(\widehat{\sigma}^2_{t,\text{v}}\) are the within-strata observed outcome variances.

General Blocking Notation

  • Blocks, \(j \in \{1, \dots, J\}\).

    • Block indicator \(B_i = j\) if \(i\) is in block \(j\).
    • Sizes: \(N_j > 2\) and proportions \(w_j = N_j / N\).
    • Number treated in each block: \(N_{1,j}\) and \(N_{0,j} = N_j - N_{1,j}\).
  • Within-block estimators:

\[ \begin{align*} \widehat{\tau}_j &= \frac{1}{N_{1,j}} \sum_{i:B_i=j} T_i Y_i - \frac{1}{N_{0,j}} \sum_{i:B_i=j} (1 - T_i) Y_i, \\ \widehat{\sigma}_j &= \frac{\widehat{\sigma}^2_{1,j}}{N_{1,j}} + \frac{\widehat{\sigma}^2_{0,j}}{N_{0,j}} \end{align*} \]

  • Aggregate blocking estimators:

\[ \widehat{\tau}_{BR} = \sum_{j} w_j \widehat{\tau}_j, \qquad \widehat{\sigma}_{BR} = \sum_{j} w_j^2 \widehat{\sigma}(\widehat{\tau}_j) \]

Efficiency of Blocking

  • Efficiency of block versus complete randomization (R) depends on the sampling scheme.

    • Usually, blocking will be more efficient (lower variance), but not always.
  • Finite sample difference in sampling variances can be given by \(\widehat{\sigma}_{R} - \widehat{\sigma}_{BR} = \frac{1}{n-1} (B - W)\), where measures of between- and within-block variation:

\[ \begin{align*} B &= \sum_{j=1}^{J} \left(\frac{N_j}{N}\right) (\overline{Y}_j(1) + \overline{Y}_j(0) - (\overline{Y}(1) + \overline{Y}(0)))^2 \\ W &= \sum_{j=1}^{J} \frac{N_j}{N} \frac{N_{1,j} N_{0,j}}{N_j} \widehat{\sigma}(\widehat{\tau}_j {\:\vert\:}\mathcal{O}) \end{align*} \]

  • Difference can be positive or negative (Pashley and Miratrix 2022):

    • Intuition: Blocking is better when outcomes vary a lot across blocks, not much within blocks (blocks are predictive of outcomes, so usually the case).
    • Blocking is also more efficient for PATE under stratified sampling.

How to Block



  • Discrete covariates \(\rightsquigarrow\) blocks by unique combinations.

  • Alternative: create blocks by forming homogeneous groups in \(\mathbf{X}\).

    • Choose a distance metric, such as the Mahalanobis distance:

    \[ M(\mathbf{X}_i, \mathbf{X}_k) = \sqrt{(\mathbf{X}_i - \mathbf{X}_k)' \hat{V}(\mathbf{X})^{-1} (\mathbf{X}_i - \mathbf{X}_k)} \]

  • Challenges:

    • Difficult/impossible to find optimal blocks in general, but “greedy” algorithms exist.
    • Possible to get optimal blocks with pair matching (\(J = n/2\)).
    • R packages optmatch and blockTools allow to perform matching.

Example: Forming Blocks using blockTools


pacman::p_load(
  blockTools, randomizr, RItools
)

set.seed(20250211)

# simulate some data
N <- 100
data <- data.frame(
  id = 1:N,
  female = sample(c(0, 1), N, replace = TRUE),
  age = round(truncnorm::rtruncnorm(N, a = 18, b = 80, mean = 30, sd = 10)),
  education = sample(1:4, N, replace = TRUE) # 1: High School, 2: College, etc.
)

# form blocks using gender and age
blocks <- block(
  data,
  id.vars = "id",
  groups = "female",
  n.tr = 2,
  block.vars = c("age"),
  distance = "mahalanobis"
)

# add block ids and random assignment
data <- 
  data |> 
  dplyr::mutate(
    block_id = 
      blockTools::createBlockIDs(
        obj = blocks, data = data, id.var = "id"),
    treat1 = 
      randomizr::complete_ra(N = n()),
    treat2 = 
      randomizr::block_ra(
        blocks = female, 
        prob = 0.5),
    treat3 = 
      randomizr::block_ra(
        blocks = block_id, 
        prob = 0.5)
  )

# output balance tests
out <-
lapply(
  1:3,
  function(x) {
    RItools::xBalance(
    fmla = as.formula(paste0("treat", x, "~ education + female + age")),
    data = data,
    report = c("adj.means", "std.diffs", "p.values"))$results[,,1] |> 
      knitr::kable(
        digits = 3, align = "cccc",
        caption = c("Complete RA", "Block by female", "Block by female and age")[x]) |> 
      kableExtra::kable_minimal(font_size = 20)
  }
)

out[[1]]; out[[2]]; out[[3]]
Complete RA
Control Treatment std.diff p
education 2.58 2.22 -0.327 0.105
female 0.42 0.64 0.447 0.028
age 32.84 32.16 -0.086 0.664
Block by female
Control Treatment std.diff p
education 2.569 2.224 -0.312 0.121
female 0.529 0.531 0.002 0.990
age 31.039 34.020 0.386 0.057
Block by female and age
Control Treatment std.diff p
education 2.54 2.26 -0.253 0.207
female 0.52 0.54 0.040 0.842
age 32.88 32.12 -0.097 0.627

Regression for Block Randomized Experiments

  • Like in the classical experiment, one can use linear regression to obtain unbiased estimates in block-randomized experiments.
  • \(p_j = p\): Use OLS with block dummies (or fixed effects) to get an unbiased estimate of the ATE:

\[ Y_i = \alpha + \tau T_i + \sum_{j=2}^{M} \beta_j B_{ij} + \epsilon_i, \quad \text{where} \quad {\mathbb{E}}[\widehat{\tau}_{OLS}] = \tau \]

  • Valid uncertainty estimates can then be obtained via HC2 standard errors (or clustered SE if randomization was clustered within blocks).
  • \(p_j\) varies by block: Use weighted least squares instead of OLS, where the weight is the inverse probability of treatment/control for \(i\) in block \(j\):

\[ \forall i, j:\: w_{ij} = \begin{cases} \frac{1}{p_j}, & \text{if } T_i = 1 \\ \frac{1}{(1 - p_j)}, & \text{if } T_i = 0 \end{cases} \]

Statistical Power

What is Power?

  • Recall that for a statistical test:

    • Type 1 error (false-positives): Rejecting the null if the null is true (\(\alpha\))
    • Type 2 error (false-negatives): Not rejecting the null if the null hypothesis is false (\(\psi\))
  • Statistical power of a design + test \(\equiv\) the test’s probability of rejecting the null in favor of the alternative when the alternative is indeed true, given the design.

  • Example: \(H_0:\: \beta = 0\) and \(H_a:\: |\beta| = \tau > 0\), giving rise to a two sided test. Then power is given by, \(\kappa = {\textrm{Pr}}[\text{Reject } H_0 {\:\vert\:}H_a \text{ is true; design, test}]\).

  • What does power depend on?

    • True size of the effect (\(\tau\)),
    • Sample size and proportion of the treated (\(N\) and \(p\)),
    • Variability of potential outcomes (\(\sigma\)),
    • Test statistic,
    • Number of treatments,
    • Randomization scheme (simple, complete, clustered, blocked, etc.)

Choice of Relative Treatment Group Size (\(p\))


  • Consider a randomized experiment with complete randomization.

  • Problem: For a given total sample size \(N\), choose the optimal treatment allocation \(p = N_1/N\) to minimize the variance of the estimator of the average treatment effect.

  • Recall that our asymptotically valid variance expression is:

\[ {\mathbb{V}}(\hat\tau) = \frac{\sigma^2_1}{p N}+\frac{\sigma^2_0}{(1-p) N} \]

  • How should we proceed? Solve a minimization problem!

Choice of Relative Treatment Group Size (\(p\))


  • Find the value \(p^*\) that makes the derivative with respect to \(p\) equal to zero:

\[ -\frac{\sigma^2_1}{p^{*2} N}+\frac{\sigma^2_0}{(1-p^*)^2 N}=0 \]

Therefore:

\[ \frac{1-p^*}{p^*} = \frac{\sigma_0}{\sigma_1} \implies p^* = \frac{\sigma_1}{\sigma_1+\sigma_0}=\frac{1}{1+\sigma_0/\sigma_1} \]

  • Intuition: A “rule of thumb” if you can assume \(\sigma_1\approx \sigma_0\) is to have \(p^{*}=0.5\)

  • For practical reasons it is sometimes better to choose unequal sample sizes (even if \(\sigma_1\approx \sigma_0\)).

Sampling Variance as Function of \(p\)

  • Suppose: \(\sigma^2_1=\sigma^2_0=1\), \(N=100\)
# function to calculate variance of DiM
variance_dim <- 
  function(p, N, sigma1 = 1, sigma0 = 1) {
    (sigma1^2 / (p * N)) + (sigma0^2 / ((1 - p) * N))
  }

# create a sequence of assignment p's
# calculate variance for each p
variance_data <- 
  tibble(
    N = 100,
    p = seq(0.01, 0.99, by = 0.01), 
    variance = 
      map2_dbl(
        N, p, 
        \(x, y) variance_dim(p = y, N = x)))

# plot using ggplot2
ggplot(variance_data, aes(x = p, y = variance)) +
  geom_line(color = "#458588", size = 1) +
  labs(
    x = bquote("Proportion of Treatment Group, " ~ N[1]/N),
    y = bquote("Variance of DiM, " ~ V(bar(Y)(1) - bar(Y)(0)))
  ) + 
    theme(text = element_text(size = 16))

Choice of Overall Sample Size (\(N\))


  • Suppose that \(\sigma^2_0=\sigma^2_1\) and \(Y_i (0) \sim (\mu_0, \sigma^2)\) and \(Y_i (1) \sim (\mu_1, \sigma^2)\)

  • Assume also that \(p=0.5\), so \(N_0=N_1=N/2\), and \(\tau=\mu_1-\mu_0\).

  • Then, for the \(t\)-statistic of equality of means, we have:

\[ \frac{\widehat{\tau}_{DiM}-\tau}{\sqrt{\frac{\sigma^2_1}{N_1}+\frac{\sigma^2_0}{N_0}}} = \frac{\widehat{\tau}_{DiM}-\tau}{\sqrt{\frac{2\sigma^2}{N}+\frac{2\sigma^2}{N}}} = \frac{\widehat{\tau}_{DiM}-\tau}{2\sigma/\sqrt{N}} \sim \mathcal{N} (0,1). \]

  • Therefore:

\[ t = \frac{\widehat{\tau}_{DiM}}{\sqrt{\frac{\sigma^2_1}{N_1}+\frac{\sigma^2_0}{N_0}}} \sim \mathcal{N}\left(\frac{\tau\sqrt{N}}{2\sigma},1\right) \]

Choice of Overall Sample Size (\(N\))



  • The power, i.e. \({\textrm{Pr}}\left(\text{Reject } \mu_1 - \mu_0 = 0 {\:\vert\:}\mu_1 - \mu_0 = \tau \right)\) is then given by:

\[ \begin{align*} {\textrm{Pr}}\left(|t| > 1.96\right) &= {\textrm{Pr}}\left(t < -1.96\right) + {\textrm{Pr}}\left(t > 1.96\right) \\ &= {\textrm{Pr}}\left(t-\frac{\tau \sqrt N}{2\sigma} < -1.96 - \frac{\tau \sqrt N}{2\sigma}\right) \\ &\qquad + {\textrm{Pr}}\left(t-\frac{\tau \sqrt N}{2\sigma}>1.96-\frac{\tau \sqrt N}{2\sigma}\right) \\ &= \Phi\left(-1.96-\frac{\tau \sqrt N}{2\sigma}\right) + \left(1-\Phi\left(1.96-\frac{\tau \sqrt N}{2\sigma}\right)\right) \end{align*} \]

General Formula for the Power Function


\[ \begin{align*} {\textrm{Pr}}(\text{reject } \mu_1-\mu_0=0 &| \mu_1-\mu_0=\tau) = \\ & \Phi\left(-1.96-\tau\Bigg/\sqrt{\frac{\sigma_1^2}{p N}+\frac{\sigma_0^2}{(1-p)N}}\right) \\ & \qquad + \left(1-\Phi\left(1.96-\tau\Bigg/\sqrt{\frac{\sigma_1^2}{p N}+\frac{\sigma_0^2}{(1-p)N}}\right)\right) \end{align*} \]

  • To choose \(N\) we need to specify:

    1. \(\tau\): usually \(0.25 \sigma_0\),
    2. Target power value (\(1- \psi\)): usually 0.80 or higher,
    3. \(\sigma_1^2\) and \(\sigma_0^2\): e.g. using previous measures,
    4. \(p\): proportion of observations in the treatment group

Power Functions



  • Power calculations for sample size assuming

    • \(\sigma^2 = \sigma_1^2 = \sigma_0^2 = 1\),
    • \(p=0.5\) and complete random assignment.
  • Intuition: Larger sample size allows to detect smaller effect sizes. Increasing sample size has a diminishing return for precision.

Minimum Detectable Effect

  • Testing \(H_0 : \beta = 0\) relative to \(H_a : |\beta| = \mu > 0\).

  • Large-sample distribution of \(\hat{\beta}\) under \(H_0\), denoted by \(F_{\hat{\beta}}\bigl(\cdot \mid \beta = 0\bigr)\).

  • For a test at \(100\bigl(1 - \alpha\bigr)\%\) confidence, the distribution under \(H_0\) defines the “rejection region.”

Minimum Detectable Effect

  • Large-sample distribution of \(\hat{\beta}\) under \(H_a\), denoted by \(F_{\hat{\beta}}\bigl(\cdot {\:\vert\:}\beta = b_{H_a}\bigr)\).

  • Power is the probability of falling in the “rejection region”:

    \[ \kappa = 1 - F_{\hat{\beta}}\bigl(t_{\alpha/2} \sigma_{\hat{\beta}} \mid \beta = b_{H_A}\bigr) \]

  • For a test with power \(\kappa\), need \(b_{H_a}\) as in the picture.

Minimum Detectable Effect


\[ \left| \beta_{\alpha, \kappa, \sigma_{\hat{\beta}}^{(0)}, \sigma_{\hat{\beta}}^{(a)}} \right| = t_{\alpha/2} \sigma_{\hat{\beta}}^{(0)} + t_{1-\kappa} \sigma_{\hat{\beta}}^{(a)} \]

where,

  • \(\alpha\) is the type I error rate (false-positives rate),
  • \(\kappa\) is the power of the test (\(1 -\) false-negatives rate),
  • \(\sigma_{\hat{\beta}}^{(h)}\) is the standard error for \(\hat{\beta}\) (i.e., \(\sqrt{{\mathbb{V}}[\hat{\beta}]}\)), under hypothesis \(h\), with \(\sigma_{\hat{\beta}}^{(h)}\) a function of sample size (\(N\) and \(p\)),
  • \(t_{\alpha/2}\) and \(t_{1-\kappa}\) are the absolute values of the \(\alpha/2\) and \(1 - \kappa\) quantiles of the reference distribution (e.g., large sample distribution of \((\hat{\beta} - \beta)/\sigma_{\hat{\beta}}^{(h)}\)).

Minimum Detectable Effect

  • Let \(\sigma_{\hat{\beta}} = \max\bigl(\sigma_{\hat{\beta}}^{(0)}, \sigma_{\hat{\beta}}^{(a)}\bigr)\).

  • Then define a conservative MDE as

RESULT: Minimum Detectable Effect

\[ \underbrace{\left| \beta_{\alpha, \kappa, \sigma_{\hat{\beta}}} \right|}_{MDE} = \bigl(t_{\alpha/2} + t_{1-\kappa} \bigr) \sigma_{\hat{\beta}}. \]

  • Suppose we test against \(\mathcal{N} (0,1)\) with \(\alpha = 0.05\) and \(\kappa = 0.80\). These are the standards in social science. Then We have, \(t_{\alpha/2} + t_{1-\kappa} = |z_{.025}| + |z_{.2}| = 1.96 + 0.84 = 2.80\).

  • Therefore, the MDE is 2.8 times the conservative standard error of the effect estimator for any study for which we use

    1. a two-sided test,
    2. a standard normal distribution as our reference distribution,
    3. \(\alpha = .05\) and \(\kappa = .80\)
  • In all applications we have studied, the normal or \(t\) distribution is a good approximation so long as the sample size is not too small.

Minimum Detectable Effect

  • Under complete random assignment we have,

\[ {\mathbb{V}}[\hat{\beta}] = \frac{1}{N} \left( \frac{\sigma^2_{1}}{p} + \frac{\sigma^2_{0}}{1 - p} \right) \]

  • Plug into the expression for MDE to get sample size determination formula.

\[ \begin{align*} MDE &= \bigl(t_{\alpha/2} + t_{1-\kappa} \bigr) \sqrt{\frac{1}{N} \left( \frac{\sigma^2_{1}}{p} + \frac{\sigma^2_{0}}{1 - p} \right)} \\ \implies N &= \frac{\bigl(t_{\alpha/2} + t_{1-\kappa} \bigr)^2 \left( \frac{\sigma^2_{1}}{p} + \frac{\sigma^2_{0}}{1 - p} \right)}{MDE^2}. \end{align*} \]

  • Note Use standardized effect sizes to avoid “power fallacy”!

    • E.g., Effect size relative to the control group standard deviation, Glass’s \(\Delta\), \(\Delta = MDE / \sigma_{0}\) (normalize standard deviations too!)

Conclusion on Power


  • Statistical power is the probability of rejecting the null hypothesis when it is indeed false.

    • It depends on design and test!
  • Optimal design considerations:

    • Balance treatment and control groups (\(p \approx 0.5\)) if you do not expect differences in potential outcomes variability.
    • Increase sample size to detect smaller effects.
    • Use blocking or stratification to improve precision.
  • Practical considerations:

    • Ensure sufficient power to detect meaningful effects.
    • Consider trade-offs between sample size and costs/harm.
    • Use power analysis to plan experiments effectively (e.g. EGAP’s power calculators).
  • Further reading: Blair, Coppock, and Humphreys (2023, Ch. 10-11).

Appendix

🔙 Why Demean? Orthogonality


  • In OLS, adding a regressor orthogonal to existing regressors doesn’t change their coefficients–it only reduces residual variance \(\Rightarrow\) smaller SEs.

  • In a randomized experiment, \(T_i {\mbox{$\perp\!\!\!\perp$}}X_i\) by design, so \({\mathrm{corr}}(T_i, X_i) \approx 0\). Great!

  • But for the interaction \(T_i X_i\) (without demeaning):

\[ \begin{align*} {\mathrm{cov}}(T_i, T_i X_i) &= {\mathbb{E}}[T_i^2 X_i] - {\mathbb{E}}[T_i] {\mathbb{E}}[T_i X_i] \\ &= \bigl({\mathbb{E}}[T_i^2] - {\mathbb{E}}[T_i]^2\bigr) {\mathbb{E}}[X_i] \neq 0 \;\text{ if } {\mathbb{E}}[X_i] \neq 0 \end{align*} \]

So \(T_i\) and \(T_i X_i\) are correlated \(\Rightarrow\) \(\widehat{\tau}\) estimates the effect at \(X = 0\), not the ATE.

  • With demeaning: \({\mathbb{E}}[\tilde{X}_i] = 0 \implies {\mathrm{cov}}(T_i, T_i \tilde{X}_i) = 0\) by construction.

  • Orthogonality restored \(\Rightarrow\) \(\widehat{\tau}_{Lin}\) is shielded from the interaction terms.

🔙 Proof of \({\mathbb{E}}[ \widehat{\tau}_{IPW} {\:\vert\:}\mathcal{O}_N ] = \tau_{SATE}\)



\[ \begin{aligned} &{\mathbb{E}}[ \widehat{\tau}_{IPW} {\:\vert\:}\mathcal{O}_N ] \\ &= {\mathbb{E}}\left[ \frac{1}{N} \sum_{i=1}^N \left\{\frac{T_iY_i}{p} - \frac{(1 - T_i) Y_i}{(1 - p)}\right\} \Bigg| \mathcal{O}_N \right] \\ &= \frac{1}{N} \sum_{i=1}^N \left\{ {\mathbb{E}}\left[ \frac{T_i Y_i (1)}{p} {\:\vert\:}\mathcal{O}_N \right] - {\mathbb{E}}\left[ \frac{(1 - T_i) Y_i (0)}{(1 - p)} {\:\vert\:}\mathcal{O}_N \right] \right\} \quad \text{($\because$ distribute ${\mathbb{E}}$/random assignment)}\\ &= \frac{1}{N} \sum_{i=1}^N \left\{ \frac{Y_i(1)}{p} {\mathbb{E}}[ T_i {\:\vert\:}\mathcal{O}_N ] - \frac{Y_i(0)}{1 - p} {\mathbb{E}}[ 1 - T_i {\:\vert\:}\mathcal{O}_N ] \right\} \quad \text{($\because$ POs are fixed)}\\ &= \frac{1}{N} \sum_{i=1}^N \left\{ \frac{Y_i (1)}{p} \cdot p - \frac{Y_i (0)}{1 - p} \cdot (1 - p) \right\} \quad \text{($\because$ definition of ${\mathbb{E}}$)}\\ &= \frac{1}{N} \sum_{i=1}^N Y_i(1) - Y_i(0) = \tau_{SATE} \end{aligned} \]

🔙 Proof of \({\mathrm{cov}}(\overline{Y}_1, \overline{Y}_0 {\:\vert\:}\mathcal{O}_N) = -S_{10}/N\)



\[ \begin{aligned} {\mathrm{cov}}(\overline{Y}_1, \overline{Y}_0 {\:\vert\:}\mathcal{O}_N) &= \frac{1}{N_1 N_0}\sum_i\sum_j Y_i(1)\,Y_j(0)\;{\mathrm{cov}}(T_i,\;1-T_j) = -\frac{1}{N_1 N_0}\sum_i\sum_j Y_i(1)\,Y_j(0)\;{\mathrm{cov}}(T_i, T_j) \\[6pt] &\text{Under CR: } {\mathbb{V}}(T_i) = \frac{N_1 N_0}{N^2}, \quad {\mathrm{cov}}(T_i, T_j)\big|_{i \neq j} = -\frac{N_1 N_0}{N^2(N-1)} \\[6pt] &= \underbrace{-\frac{1}{N^2}\sum_i Y_i(1)Y_i(0)}_{\text{diagonal } (i=j)} \;+\; \underbrace{\frac{1}{N^2(N-1)}\sum_{i \neq j} Y_i(1)Y_j(0)}_{\text{off-diagonal}} \\[6pt] &= -\frac{1}{N^2}\sum_i Y_i(1)Y_i(0) + \frac{1}{N^2(N-1)}\left[N^2\,\overline{Y(1)}\,\overline{Y(0)} - \sum_i Y_i(1)Y_i(0)\right] \\[6pt] &= -\frac{\sum_i Y_i(1)Y_i(0)}{N^2}\cdot\frac{N}{N-1} + \frac{\overline{Y(1)}\,\overline{Y(0)}}{N-1} = -\frac{1}{N}\underbrace{\frac{1}{N-1}\left[\sum_i Y_i(1)Y_i(0) - N\,\overline{Y(1)}\,\overline{Y(0)}\right]}_{= S_{10}} = -\frac{S_{10}}{N} \end{aligned} \]

Cluster Robust SEs

  • Assume the following model: \(Y_i = \tau T_i + \varepsilon_i, {\mathbb{E}}[\varepsilon_i] = 0\) where

\[ {\mathbb{V}}[\hat \tau ] = {\mathbb{V}}\left [\sum_i T_i \varepsilon_i \right ] / \left ( \sum_i T_i^2 \right )^2 \]

  • If we assume:

    • \({\mathrm{cov}}[\varepsilon_i,\varepsilon_j] = 0\), \({\mathbb{V}}[\varepsilon_i] = \sigma^2\), then \({\mathbb{V}}[\hat \tau ] = \sigma^2 / \sum_i T_i^2\) (homoskedasticity)
    • \({\mathrm{cov}}[\varepsilon_i,\varepsilon_j] = 0, {\mathbb{V}}_{\textrm{HC2}}[\hat \tau] = \left ( \sum_i T_1^2 \cdot {\mathbb{V}}[\varepsilon_i] \right ) / \left ( \sum_i T_i^2 \right )^2\) (heteroskedasticty)
    • \({\mathrm{cov}}[\varepsilon_i,\varepsilon_j] = 0\) unless observations \(i\) and \(j\) share the same cluster:

\[ {\mathbb{V}}_{\textrm{CR}}[\hat \tau] = \left (\sum_i \sum_j \textcolor{blue}{T_i T_j} \textcolor{red}{{\mathrm{cov}}[\varepsilon_i, \varepsilon_j]} \mathbb{1}[i,j \textrm{ in the same cluster}] \right) / \left(\sum_i T_i^2 \right )^2 \]

  • \({\mathbb{V}}_{\textrm{CR}}[\hat \tau] > {\mathbb{V}}_{\textrm{HC2}}[\hat \tau]\) i.f.f \(T_i\) and \(T_j\) are correlated within clusters and \({\mathrm{cov}}[\varepsilon_i, \varepsilon_j] > 0\).

External Validity


  • When it is possible, a randomized experiment is the best approach for estimating causal effects

  • Identification is justified by design + Estimation & Inference are simple

  • High Internal Validity: we can estimate the SATE without bias, without making strong modeling assumptions

  • Common concern: External Validity

  • Egami and Hartman (2023) propose a framework for systematic sources of external validity:

    • \(X\)-validity: Generalizability of units.
    • \(T\)-validity: Generalizability of treatments.
    • \(Y\)-validity: Generalizability of outcomes.
    • \(C\)-validity: Generalizability of contexts.

Addressing External Validity

  • Statistical Adjustment: Employ covariate adjustment techniques such as regression adjustment, matching (e.g., propensity score matching), and stratification to address differences in the distributions of \(X\), \(T\), \(Y\), and \(C\) between the study sample and the target population.
  • Weighting Methods: Utilize inverse probability weighting (IPW) and propensity score weighting to reweight the study sample to mirror the target population. Methods include stabilized weights and entropy balancing.
  • Sensitivity Analysis: Perform sensitivity analyses using methods like Rosenbaum bounds, \(E\)-values, and tipping point analysis to evaluate the robustness of findings to potential assumption violations.
  • Multilevel Modeling: Apply hierarchical models such as mixed-effects models, random effects models, and Bayesian hierarchical models to account for clustering and variation at different levels (e.g., individual, group, context).
  • External Validation: Replicate the study in various settings using techniques like cross-validation, split-sample validation, and external dataset validation to confirm the generalizability of the findings.

References

Angrist, Joshua D, and Jörn-Steffen Pischke. 2009. Mostly Harmless Econometrics: An Empiricist’s Companion. Princeton University Press.
Athey, Susan, Dean Eckles, and Guido W Imbens. 2018. “Exact p-Values for Network Interference.” Journal of the American Statistical Association 113 (521): 230–40.
Blair, Graeme, Alexander Coppock, and Macartan Humphreys. 2023. Research Design in the Social Sciences: Declaration, Diagnosis, and Redesign. Princeton University Press.
Bloom, Howard S. 1995. “Minimum Detectable Effects: A Simple Way to Report the Statistical Power of Experimental Designs.” Evaluation Review 19 (5): 547–56.
Cornfield, Jerome. 1978. “Symposium on CHD Prevention Trials: Design Issues in Testing Life Style Intervention: Randomization by Group: A Formal Analysis.” American Journal of Epidemiology 108 (2): 100–102.
Egami, Naoki, and Erin Hartman. 2023. “Elements of External Validity: Framework, Design, and Analysis.” American Political Science Review 117 (3): 1070–88.
Fisher, Ronald Aylmer. 1936. “Design of Experiments.” British Medical Journal 1 (3923): 554.
Freedman, David A. 2008. “On Regression Adjustments to Experimental Data.” Advances in Applied Mathematics 40 (2): 180–93.
Gerber, Alan S, Donald P Green, and Christopher W Larimer. 2008. “Social Pressure and Voter Turnout: Evidence from a Large-Scale Field Experiment.” American Political Science Review 102 (1): 33–48.
Ho, Daniel E, and Kosuke Imai. 2006. “Randomization Inference with Natural Experiments: An Analysis of Ballot Effects in the 2003 California Recall Election.” Journal of the American Statistical Association 101 (475): 888–900.
Imbens, Guido W, and Donald B Rubin. 2015. Causal Inference in Statistics, Social, and Biomedical Sciences. Cambridge University Press.
Lei, Lihua, and Peng Ding. 2021. “Regression Adjustment in Completely Randomized Experiments with a Diverging Number of Covariates.” Biometrika 108 (4): 815–28.
Li, Xinran, and Peng Ding. 2017. “General Forms of Finite Population Central Limit Theorems with Applications to Causal Inference.” Journal of the American Statistical Association 112 (520): 1759–69.
Lin, Winston. 2013. “Agnostic Notes on Regression Adjustments to Experimental Data: Reexamining Freedman’s Critique.” The Annals of Applied Statistics 7 (1): 295–318.
Offer-Westort, Molly, Alexander Coppock, and Donald P Green. 2021. “Adaptive Experimental Design: Prospects and Applications in Political Science.” American Journal of Political Science 65 (4): 826–44.
Pashley, Nicole E, and Luke W Miratrix. 2022. “Block What You Can, Except When You Shouldn’t.” Journal of Educational and Behavioral Statistics 47 (1): 69–100.
Wantchekon, Leonard. 2003. “Clientelism and Voting Behavior: Evidence from a Field Experiment in Benin.” World Politics 55 (3): 399–422.